## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
I started with quality (the output variable) to understand the overall distribution of wines. The univariate graph reveals a normal distribution of wine quality, with a low score (min) of 3, mean of 5.6, and a high score (max) of 8. I also created a factor set in order to group the wines as Poor (0,4), Average (5,6), Good, (7-10)
Next, I shifted to analyzing the input variables, starting with the alcohol content. This graph reveals a right-skewed distribution, with the mean/median circling 10% alcohol by volume (10.42,10.2 respectively).
## Warning: position_stack requires constant width: output may be incorrect
Then I looked at the potassium sulphate content (which contributes to SO2). By adjusting the binwidth to appropriate level for the variable, the distribution appeared to be slightly right-skewed, with some outliers extending beyond 1 all the way out to 2.0. The mean & median are .658 and .620, but the maximum is 2.0, which is causing the right tail distribution.
Two other related variables, free sulfur dioxide and total sulfur dioxide (SO2), follow the same right-skewed distribution after adjusting the binwidths to the appropriate format. This makes sense as, logically, we would expect these variables to be correlated, and thus follow a similar distribution
Next was pH, which measures the acidity from 0 (very acidic) to 14 (very basic). The distribution of our wines appears to follow a normal distribution, with 75% of our sample wines falling within 3.2 and 3.4 on the pH scale.
## Warning: position_stack requires constant width: output may be incorrect
Then I looked at two other variables related to pH: Fixed Acidity and Volatile Acidity. At first glance, both appear slightly right-skewed, following the same distribution of pH. When I adjusted the binwidth of Volatile Acidity further, I revealed a slight bi-modal distribution around .4 and .65:
Then I looked at citric acid content, which ranges in our dataset from 0 to 1, with a right-skewed distribution. It would appear as if 0 is actually our most common value, with other peaks around .25 and .5. The mean & median is . 271 and .260, respectively, with a max of 1.0.
Then I looked at residual sugar. It’s a pretty clear right-skewed distribution, with a mean of 2.54, but a max of 15.50. 75% of the data falls between 1.90 and 2.6, however it is because of the outliers that we see such a strong right skew.
## Warning: position_stack requires constant width: output may be incorrect
Finally I looked at the chlorides (amount of salt in the wine). Chlorides had a very similar output as residual sugar, with a strong right-skew caused by some clear outliers. 75% of the data fell between .07 and .09 (median of .08), but the maximum value is .61. The max and other outliers impacts the mean, bringing it to .087…getting very close to the 3rd quartile value. Because of the strong right skew, I changed the graph to a log10 analysis of clorides, normalizing the distribution.
The Red Wine dataset contains 1,599 red wines with 11 attributes, describing the chemical properties of the wine and the resulting quality. The quality output comes from the median rating of at least 3 wine experts, with values from 0 (very bad) to 10 (very excellent).
I am interested to see how different features impact the quality rating of red wines. In particular, I am eager to see if & how volatile acidity and citric acid levels can help indicate/predict red wine quality.
I believe alcohol, pH, fixed acidity, chlorides, residual sugar and sulfur dioxide levels could also have an affect on wine quality. I think acidity levels are likely the main indicator, but will evaluate the impacts of these other features as weel.
I created a factor set for quality. Instead of the values 3,4,5, etc., I created a grouping in order to look at the wine quality more wholistically as Poor (0-4), Average (5-6), and Good (7-10). Later in my analyses (multivariate) I again created a factor set for quality - Below Average (0-5), Above Average (6-10) - in order to better delineate wine behavior.
I normalized the distribution of chlorides by transforming the graph output to reveal the log10 output of chlorides.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
In order to get a full understanding of the relationship among the data set’s variables in a single view, I created a scatterplot matrix. From this view, there are a few highlights that caught my eye:
Citric Acid - There is a negative correlation with pH and positive correlations with sulphates and density
Volatile Acidity - There is a negative correlation between volatile acidity and quality as well as citric acid.
Alcohol - There is a positive correlation between alcohol and quality, and negative correlations with density, chlorides, and volatile acidity
Sulphates - Negative correlation with chlorides
Quality - The key output variable has a positive correlation with alcohol and sulphates. Along with a negative correlation with volatile acidity and Total SO2
The highest correlation in this data set was between pH and fixed acidity
Next I started to look further into the various relationships between variables.
I looked at citric acid against pH. I added a line to the scatterplot in order to view the median pH over citric acid levels, revealing the negative correlation between the two variables (albeit not overwhelmingly strong).
Plotting citric acid against wine quality appears to yield no discernable relationship. I now turn to other variables in my dataset in order to help predict quality.
I next investigated volatile acidity. The relationship between volatile acidity and citric acid appears to be a negative correlation until the citric acid level reaches .5, at which point it appears to trend slightly positively.
Looking at the relationship between volatile acidity and quality, I can see the negative correlation. It appears that the higher quality wines tend to have a lower, more concentrated volatile acidity (very few outliers among the Good wines)
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.580 11.000
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
Looking at alcohol, I first analysed alcohol and volatile acidity. There doesn’t appear to be a strong relationship between the two. I then shifted focus to alcohol and wine quality. Based on this graph (with the additional median quality line on the scatterplot), one can see the positive correlation between alcohol level and wine quality. By running a summary of median alcohol content by quality levels, one can further see the values backing up the graph. The median alcohol content in a wine with a quality level of 3 is 9.925, while the alcohol content of an 8 quality wine is 12.15.
Next I looked at a few additional variables against quality that didn’t seem to lead to any strong relationships. I can see a slight positive correlation between quality and sulphates.
I can also notice a slight negative correlation between Total SO2 levels and wine quality.
Plotting quality against residual sugar yielded little insight. Merely that lower levels of sugar is consistent between the various levels of quality.
From my bivariate plots I was able to uncover different feature relationships and non-relationships within my data set. One of my key features of interest in affecting quality was citric acid. By plot of citric acid and quality was disappointing in that it revealed no strong relationship.
However, volatile acidity did prove to have a visible negative correlation with quality.
The other interesting relationship I noticed was alcohol content and quality. Not expecting the alcohol content to be an indicator of quality, I was surprised to see a very clear positive relationship between the two features. Looking at the median alcohol content at each quality level (rather than mean in order to mitigate any outlier impact), one can see a large difference: 9.9 for lowest quality up to 12.2 for the highest.
From purely correlation, the strongest relationship in this data set is between pH and fixed acidity, which is very much expected. As pH is a measure of the acidity in wine, and fixed acidity is a part of that, the correlation supports our general knowledge and assumptions.
## Warning: Removed 15 rows containing missing values (stat_smooth).
## Warning: Removed 14 rows containing missing values (stat_smooth).
## Warning: Removed 34 rows containing missing values (geom_point).
It took countless hours to reach this point, but it was during the multivariate analysis section that I decided to revisit my quality factor set. Instead of three levels, I tried to simplify to show Below Average (0-5) and Above Average (6-10) quality wines. After creating this feature and regraphing my subsequent plots, the relationship between variables became so much more clear.
Starting alcohol content and volatile acidity, I was able to graph the relationship of those two features with the new quality score. In an attempt to eliminate the affect of outliers, I looked only at the bottom 99% of data points and am able to see a relatively clear delineation between the above and below average wines. The wines with a lower alcohol content and slightly higher volatile acidity seem to rate lower in quality. While the wines with a higher alcohol content and lower volatile acidity appear to rate higher in quality.
## Warning: Removed 32 rows containing missing values (geom_point).
Next I plotted the relationship between alcohol and sulphates with a color layer for wine quality. Once again, a rather clear behavior can be see for Below Average wines vs Above Average wines. One can see a clear cluster for the below average wine (low alcohol content, lower sulphates), vs. the Above Average wine (with higher alcohol content and slightly higher sulphates).
## Warning: Removed 34 rows containing missing values (geom_point).
When graphing the relationship between volatile acidity and sulphates, with the quality overlay, once again a clear difference in behavior can be seen between the Below Average and Above Average wines. In this case, the Below Average wines tend to have higher levels of volatile acidity and just slighly lower levels of sulphates. Whereas the Above Average wines have a lower volatile acidity and slightly higher sulphate levels.
## Warning: Removed 32 rows containing missing values (geom_point).
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 20 rows containing missing values (stat_smooth).
## Warning: Removed 12 rows containing missing values (stat_smooth).
## Warning: Removed 32 rows containing missing values (geom_point).
Then I wanted took look at the ratio of volatile acidity to fixed acidity against density, and the impact to wine quality. From the first plot, it appears the Below Average wines have a higher ratio (volatile: fixed), with Above Average wines having a lower ratio, which further suports the previous graph showing simple volatile acidity. The density doesn’t appear to vary greatly between the quality groups. Once I added a smooth line for the median, it became more clear that holding the acidity ratio constant, the Above Average wines have a lower density than the above average wines.
## Warning: Removed 13 rows containing missing values (stat_smooth).
## Warning: Removed 17 rows containing missing values (stat_smooth).
## Warning: Removed 31 rows containing missing values (geom_point).
For my final multivariate plot, I graphed the ratio of volatile acidity to fixed acidity against alcohol content, and, of course, overlaying the data with the wine quality. In this feature graph, we see that our Above Average quality wine has a higher alcohol content and lower acidity ratio, while the Below Average wine has a lower alcohol content and higher acidity ratio.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + sulphates, data = wine)
## m3: lm(formula = quality ~ alcohol + sulphates + total.sulfur.dioxide,
## data = wine)
## m4: lm(formula = quality ~ alcohol + sulphates + total.sulfur.dioxide +
## chlorides, data = wine)
## m5: lm(formula = quality ~ alcohol + sulphates + total.sulfur.dioxide +
## chlorides + volatile.acidity + volatile.acidity:fixed.acidity,
## data = wine)
##
## ===================================================================================
## m1 m2 m3 m4 m5
## -----------------------------------------------------------------------------------
## (Intercept) 1.875*** 1.375*** 1.650*** 1.970*** 2.942***
## (0.175) (0.177) (0.185) (0.192) (0.206)
## alcohol 0.361*** 0.346*** 0.329*** 0.302*** 0.282***
## (0.017) (0.016) (0.017) (0.017) (0.017)
## sulphates 0.994*** 1.025*** 1.282*** 0.892***
## (0.102) (0.102) (0.110) (0.111)
## total.sulfur.dioxide -0.003*** -0.003*** -0.002***
## (0.001) (0.001) (0.001)
## chlorides -2.324*** -1.746***
## (0.405) (0.392)
## volatile.acidity -1.438***
## (0.166)
## volatile.acidity x fixed.acidity 0.042*
## (0.019)
## -----------------------------------------------------------------------------------
## R-squared 0.227 0.270 0.280 0.295 0.353
## adj. R-squared 0.226 0.269 0.279 0.293 0.351
## sigma 0.710 0.690 0.686 0.679 0.651
## F 468.267 294.988 207.177 166.754 145.058
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1675.142 -1663.543 -1647.155 -1577.954
## Deviance 805.870 760.894 749.934 734.719 673.799
## AIC 3448.114 3358.284 3337.085 3306.310 3171.908
## BIC 3464.245 3379.793 3363.971 3338.573 3214.925
## N 1599 1599 1599 1599 1599
## ===================================================================================
I also attempted to create a linear model to predict wine quality, based on a set of features available in the data set. Starting with alcohol, I added sulphates, total SO2, chlorides, and the volatile to fixed acidity ratio. Beginning with an R^2 value (which helps to identify goodness of fit) of .227 I was able to increase my R^2 value to .353. This value, unfortunately, would not indicate a strong fit (want to get close to 1.0). I did, however, test this model against my data set and was able to predict the correct quality value, using a 95% confidence interval.
During my multivariate analysis, I was able to observe the impact of different feature relationships on the quality rating of red wines. Based on my bivariate analysis I was able to take some of the features that had appeared to impact quality and combine those with other features that appeared to have a correlation to each other.
As one example, alcohol and sulphates have a positive correlation both with each other and with quality. Plotting those two features together, along with wine quality, one can see that as both variables increase, the output (quality) generally increases as well. Holding sulphates content, quality is generally Above Average when alcohol content increases.
I think what surprised me the most was the overlap in my Above Average and Below Average quality wines. While clear distinctions could be seen in my feature graphs (sulphates vs. volatile acidity, alcohol vs. chorides, etc.), there are still many wines that defy the general trends. This surprising discovery also impacted my attempt to create a model to predict wine quality (see below).
Yes I created a linear model using: alcohol, volatile/fixed acidity ratio, sulphates, chlorides, and total SO2.
This model only explains 35% of variance in quality of red wines, which was disappointing. I think part of the trouble in predicting wine quality based on this set of features is that we’re attempting to use quantitative data to predict a subjective result. There are additional limitations in the data set, discussed in my reflection, that also hinder my attmept to create a proper model.
The distribution of alcohol content between Below Average and Above average is clearly distinct. Below Average wines peak at ~9.5% alcohol content with a tight distribution, while Above Average wines are more distributed, but peak at ~11% alcohol content.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 43 rows containing missing values (stat_smooth).
## Warning: Removed 52 rows containing missing values (stat_smooth).
## Warning: Removed 99 rows containing missing values (geom_point).
Looking at the relationship between pH and sulphates and the impact on red wine quality, there appears to be distinction, although not well-defined. Holding pH constant, it appears Above Average wines have a higher sulphate content. The median smooth lines added the the graph allow one to see the distinction, however the underlying points on the graph reveal a great overlap of the Above and Below Average wines.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 6 rows containing missing values (stat_smooth).
## Warning: Removed 16 rows containing missing values (stat_smooth).
## Warning: Removed 7 rows containing missing values (stat_smooth).
The Red Wine data set contained details on ~1600 wines. I had to begin my analysis by understanding the various features within my data set, in order to understand their relationships with other variables as well as with quality. Once I understood the variables at play, I worked to graph and analyze the relationships in order to predict quality output of wine. Eventually I created a linear model to do just that, including alcohol, volatile/fixed acidity ratio, chlorides, sulphates, and total sulfur dioxide. However, I was disappointed in the low R^2 value that resulted. Based on my analyses I find that alcohol content and volatile acidity likely have the greatest impact on predicting wine quality. A higher alcohol level, coupled with a lower volatile acidity, appears to result in a higher quality wine.
There are many limitations to this data set. The wines included all come from the same region: Portugal. Because of the varieties of wine from all over the world, the limits any ability to extrapolate beyond Portugal. The data set also doesn’t include certain key features of the wine that I believe (as a self-proclaimed wine connoisseur) would have a great impact on the wine quality: grape type, winery, region, and year. And the third and likely greatest limitation is that the bulk of the wines in this data set fall within the Average quality range, with limited low and high quality wines. This unbalanced data set causes great difficulty in trying to understand and predict what inputs create an excellent wine. In order to conduct any further analysis, I would love to be able to bulk up the existing data set with global wines with the additional key data elements that I feel are really key to being able to predict Red Wine quality.